The iCrawl Wizard - Supporting Interactive Focused Crawl Specification

نویسندگان

  • Gerhard Gossen
  • Elena Demidova
  • Thomas Risse
چکیده

Collections of Web documents about specific topics are needed for many areas of current research. Focused crawling enables the creation of such collections on demand. Current focused crawlers require the user to manually specify starting points for the crawl (seed URLs). These are also used to describe the expected topic of the collection. The choice of seed URLs influences the quality of the resulting collection and requires a lot of expertise. In this demonstration we present the iCrawl Wizard, a tool that assists users in defining focused crawls efficiently and semiautomatically. Our tool uses major search engines and Social Media APIs as well as information extraction techniques to find seed URLs and a semantic description of the crawl intent. Using the iCrawl Wizard even non-expert users can create semantic specifications for focused crawlers interactively and efficiently.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The iCrawl System for Focused and Integrated Web Archive Crawling

The large size of the Web makes it infeasible for many institutions to collect, store and process archives of the entire Web. Instead, many institutions focus on creating archives of specific subsets of the Web. These subsets may be based around specific topics or events. Our iCrawl system provides a focused crawler that is able to automatically collect Web pages relevant to a topic based on co...

متن کامل

New Media Collaboration through Wizard-of-Oz Simulations

The “Wizard of Oz” (WOz) method for supporting the evaluation of incomplete computer prototypes derives from the early days of the human-computer interaction community [5], and has subsequently been utilized by hundreds of researchers and designers to emulate technologies in interactive media systems [1,2]. Despite its popularity, we believe opportunities exist for expanding the method’s use th...

متن کامل

Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies

Compared to the standard web search engines, focused crawlers yield good recall as well as good precision by restricting themselves to a limited domain. In this paper, we do not introduce another focused crawler, but we introduce a generic framework for focused crawling consisting of two major components: (1) Specification of the user interest and measuring the resulting relevance of a given we...

متن کامل

Contract Negotiation Wizard for Vo Creation

The establishment of collaboration commitments, represented by contracts or agreements, is a crucial step in a virtual organization (VO) creation process. The contract negotiation shall proceed in parallel with the other phases of the VO creation process, namely preparatory planning, consortia formation, and VO launching. In each step specific elements for the contract / agreement are collected...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015